Sample Information

Information related to each single sample have been retrieved by the looking up samples in the GTEx portal as well as by recovering metadata stored in the RangedSummarizedExperiment as shown in the recount2 vignette, making use of recount Bioconductor package. Unfortunately many NAs are present in the metadata.

Brain

Sample Sample GTEX ID Region Death Sex
SRR2167642 GTEX-T6MN-0011-R11A-SM-5S2TH Brain - Cerebellar Hemisphere Actual Death male
SRR607445 GTEX-OHPN-0011-R4A-SM-2I5FD Brain - Amygdala Actual Death female
SRR663753 GTEX-PVOW-2526-SM-2XCF7 Brain - Cortex Actual Death male

Heart

Sample Sample GTEX ID Region Death Sex
SRR1387702 GTEX-PWCY-0526-SM-5PNU8 Heart - Left Ventricle Actual Death female
SRR606939 GTEX-POMQ-0326-SM-2I5FO Heart - Left Ventricle Actual Death female
SRR1468613 GTEX-Y5V6-0826-SM-4VBRU Heart - Left Ventricle Actual Death male

Kidney

Sample Sample GTEX ID Region Death Sex
SRR1469746 GTEX-11GS4-2326-SM-5A5KS Kidney - Cortex Actual Death male
SRR1071807 GTEX-T5JC-1526-SM-4DM68 Kidney - Cortex Actual Death male
SRR1486080 GTEX-1399S-0526-SM-5IJG8 Kidney - Cortex Actual Death female

Data Pre-Processing

Count data were pre-processed in order to remove short genes ( <200 bp) and mitochondrial genes. Genes which are both short and mitochrondrial are considered as short.

Summary Table

Summary table displaying percentages of short genes and mitochondrial genes for each replicate. If a gene is both “short” and on the MT DNA, it only counts towards “short” genes total.

Tot_Reads Tot_Short % Tot_Short Tot_MT % Tot_MT Tissue
SRR2167642 32602806 34505 0.1058344 1810929 5.554519 Brain
SRR607445 35456664 207924 0.5864173 18885940 53.264853 Brain
SRR663753 37607691 76773 0.2041418 7726701 20.545534 Brain
SRR1387702 39505018 91175 0.2307935 10113998 25.601806 Heart
SRR606939 38788919 105861 0.2729156 12467758 32.142577 Heart
SRR1468613 39229440 121991 0.3109680 14936858 38.075634 Heart
SRR1469746 37448629 81223 0.2168918 15420307 41.177227 Kidney
SRR1071807 37288595 83581 0.2241463 15028849 40.304144 Kidney
SRR1486080 37979700 119836 0.3155265 5634358 14.835183 Kidney

Explanatory Analysis

DGEList object

Before moving on with the subsequent analysis, the table obtained above, containg the counts for each gene across the 9 samples, must be converted in a DGEList object. Samples are also labeled according to the tissue they come from (Brain, Heart, Kidney).

An object of class "DGEList"
$counts
                   SRR2167642 SRR607445 SRR663753 SRR1387702 SRR606939
ENSG00000000003.14        126       302       218        148       182
ENSG00000000005.5           1         3         8          1         4
ENSG00000000419.12       1061       229       792        715       561
ENSG00000000457.13        772       196       466        233       293
ENSG00000000460.16        603       102       214         99       137
ENSG00000000938.12         51       250       172        168       127
                   SRR1468613 SRR1469746 SRR1071807 SRR1486080
ENSG00000000003.14        179       1161        619       1095
ENSG00000000005.5           5          4          6         45
ENSG00000000419.12        574        579        589        685
ENSG00000000457.13        263        355        333        723
ENSG00000000460.16        124        172        159        357
ENSG00000000938.12        281        325        124       1368

$samples
            group lib.size norm.factors
SRR2167642  Brain 30757372            1
SRR607445   Brain 16362800            1
SRR663753   Brain 29804217            1
SRR1387702  Heart 29299845            1
SRR606939   Heart 26215300            1
SRR1468613  Heart 24170591            1
SRR1469746 Kidney 21947099            1
SRR1071807 Kidney 22176165            1
SRR1486080 Kidney 32225506            1

Filtering

Given the large amount of genes, it is recommendable to filter out all genes that have zero or really low counts. Also, gene counts will be log-normalized following this step. Count values smaller than 1 will give negative values which may affect gene comparisons. egdeR performs the filtering taking into account also the library size of each sample.

A total of 7,114 genes have zero count. After filtering, only 24,446 were retained for subsequent analysis.

Count Normalization

Counts are normalized according to the egdeR’s default method TMM. TMM normalization is a simple yet robust way to estimate the ratio of RNA production and uses a weighted trimmed mean of the log expression ratios.

An object of class "DGEList"
$counts
                   SRR2167642 SRR607445 SRR663753 SRR1387702 SRR606939
ENSG00000000003.14        126       302       218        148       182
ENSG00000000419.12       1061       229       792        715       561
ENSG00000000457.13        772       196       466        233       293
ENSG00000000460.16        603       102       214         99       137
ENSG00000000938.12         51       250       172        168       127
                   SRR1468613 SRR1469746 SRR1071807 SRR1486080
ENSG00000000003.14        179       1161        619       1095
ENSG00000000419.12        574        579        589        685
ENSG00000000457.13        263        355        333        723
ENSG00000000460.16        124        172        159        357
ENSG00000000938.12        281        325        124       1368
24441 more rows ...

$samples
            group lib.size norm.factors
SRR2167642  Brain 30654722    1.1005668
SRR607445   Brain 16317541    1.0477393
SRR663753   Brain 29702207    1.0970365
SRR1387702  Heart 29281810    0.7231013
SRR606939   Heart 26189859    0.7597152
SRR1468613  Heart 24151973    0.8002918
SRR1469746 Kidney 21908829    1.2789012
SRR1071807 Kidney 22137700    1.1036751
SRR1486080 Kidney 32057855    1.2738922

The logarithm of the normalized counts was also taken. Boxplots below display the distribution of the normalized-log counts for each sample.

Building the Design Matrix

We built a design matrix with no intercept. There is no base condition the tissues can be compared with since we have tissues coming from completely different areas of the body.

           Brain Heart Kidney
SRR2167642     1     0      0
SRR607445      1     0      0
SRR663753      1     0      0
SRR1387702     0     1      0
SRR606939      0     1      0
SRR1468613     0     1      0
SRR1469746     0     0      1
SRR1071807     0     0      1
SRR1486080     0     0      1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"

Visualization - Plotting Samples

Samples are plotted in a new two-dimensinal space in order to check whether samples coming from the tissue cluster well together and are far apart from clusters of the other tissues. The new space is defined using dimensionality reduction techniques in order to find components along which the data are more separated, i.e. components having the largest variance. For bulk RNA-seq, Principal Component Analysis (PCA) and Multi Dimensional Scaling (MDS) are mostly used for visualization purposes.

plotMDS

MDS is a non linear transformation which attempts to preserve relative distances. The log2 fold change(log2FC) values among the samples are employed instead of expression counts. Choose option plotMDS-Replicate to see MDS projection with replicates being labeled with corresponding sample name.

Samples belonging to the same tissues seem to be clustering well together. They identify 3 separate clusters, each representing a different tissue.

plotPCA

PCA attempts to find a linear transformation of the data that produces combinations of variables with maximum variation between samples while preserving relative distances. PCA is a special case of MDS.

Samples belonging to the same tissues seem to be clustering well together. They identify 3 separate clusters, each representing a different tissue.

Estimate Dispersion

Counts are assumed to have a Negative Binomial distribution which allows to account for both technical and biological varaibility. Dispersion parameter, which summarizes the biological variability, must be estimated directly from the data since it directly depends on the sample being analysed. Genewise biological coefficient of variation is plotted against the average log CPM (gene abundance). EgdeRalso estimates the common dispersion for all genes (red curve).

[1] 0.328325

The estimated common dispersion parameter has a value of 0.328.

Generalized Linear Model

To find the genes that are differently expressed in the different tissue we need to fit a generalized linear model using the design matrix built above. A generalized linear model is fitted beacuse the gene counts have a negative binomial distribution, whereas, in order to be able to fit a traditional linear model, the outcome of interest (counts) must be normally distributed.

Testing DE Genes

In ordert to test differentially expressed genes, glmQLFTest. Contrasts must be specified. Tissue having contrast equal to 1 is the tissue for which up-regulated genes and down regulated genes are found related to other tissue, contrast of -1. Since multiple tests are carried out simultaneously, the p-value for each test should be adjusted for multiple correction (False Discovery Rate,FDR). The default adjustment method used by egdeR is the Benjamini & Hochberg correction. Different thresholds for the significance level were considered so as to find the more statistically significant genes, i.e the ones for which is less likely that genes being tested come from the same distribution. Genes are considered differentially expressed if the adjusted p-value (FDR) is less than the significance level. A threshold on log2 fold-change (absolute value of log2FC > 1) was also added to make selection more stringent after lowering the significance level (stricter threshold). Log2FC is defined as the log-ratio of TMM values for a pair of expressed genes. A significance level of \(\alpha\) = 0.01 was chosen. Adding a threshold on log2FC did not make any difference on the total amount of DE genes found.

DE Genes

Heart vs. Brain

Genes differentially expressed in Heart tissue with respect to Brain tissue.

       -1*Brain 1*Heart
Down               2056
NotSig            20915
Up                 1475
logFC logCPM F PValue FDR
ENSG00000168280.16 -9.037983 8.605205 372.0951 0e+00 0.0001692
ENSG00000174521.7 -8.092650 4.111864 325.5484 0e+00 0.0001692
ENSG00000267257.1 9.719105 4.687175 271.1180 0e+00 0.0001806
ENSG00000168243.10 -7.641706 4.206588 268.7566 0e+00 0.0001806
ENSG00000110076.18 -7.141565 6.110606 259.4211 0e+00 0.0001806
ENSG00000198796.6 9.689517 6.603624 242.2253 1e-07 0.0001806

Kidney vs. Brain

Genes differentially expressed in Kidney tissue with respect to Brain tissue.

       -1*Brain 1*Kidney
Down                1614
NotSig             21669
Up                  1163
logFC logCPM F PValue FDR
ENSG00000265168.1 -5.660853 5.975233 308.1608 0e+00 0.0002183
ENSG00000168280.16 -7.751925 8.605205 306.2097 0e+00 0.0002183
ENSG00000166589.12 13.427811 7.895476 280.1667 0e+00 0.0002183
ENSG00000174521.7 -6.679155 4.111864 264.2833 0e+00 0.0002183
ENSG00000196104.10 -9.053960 4.820608 230.1301 1e-07 0.0002534
ENSG00000162728.4 -10.381074 6.139604 221.1241 1e-07 0.0002534

Kidney vs. Heart

Genes differentially expressed in Kidney tissue with respect to Heart tissue.

       -1*Heart 1*Kidney
Down                 954
NotSig             22487
Up                  1005
logFC logCPM F PValue FDR
ENSG00000186439.12 -10.016047 6.725510 288.3992 0e+00 0.0001996
ENSG00000265168.1 -5.374136 5.975233 285.1522 0e+00 0.0001996
ENSG00000214097.4 -10.847967 4.524362 284.1654 0e+00 0.0001996
ENSG00000166589.12 11.570320 7.895476 244.5069 1e-07 0.0001996
ENSG00000144035.3 11.442332 5.137774 239.0738 1e-07 0.0001996
ENSG00000113396.12 -9.941702 3.569841 228.8086 1e-07 0.0001996

Saved CVS Files

Question 9

Find the gene up-regulated in Brain tissue vs Heart tissue with the lowest FDR and which is also up-regulated also in Brain tissue vs Kidney tissue.

[1] TRUE
[1] "ENSG00000168280.16"

Limma

Among all different Bioconductor packages that allow you to carry out differential expression analysis, edgeR is the least conservative, i.e. more powerful, thereby more likely to find a statistically significant result. Limma (Linear Models for MicroArrayis) is another package for analysis of gene expression data stemming from analysis of microarrays. In limma, counts (given enough replicates are present) is assumed to be normally distributed, hence only mean and variance must be estimated, no dispersion parameter is needed. Since the distribution of the counts is approximately normal, a traditional linear model can be fitted to the data and a simple t-test can be employed to test whether genes are differentially expressed across the two tissues being compared. Limma bit more conservative than edgeR, so we should obtain a lower number of differentially expressed genes.

Like edgeR, limma uses log-normalized counts per million (TMM). The authors of limma are the same as edgeR, so the packages share some of the steps in the analysis process. They only differ in the model fitted to the gene expression data.

Using the same DGEList object created above for analysis with edgeR without including the part involving the estimation of dispersion parameter.

Design Matrix and Constrat Matrix

Building the desing matrix.

           Brain Heart Kidney
SRR2167642     1     0      0
SRR607445      1     0      0
SRR663753      1     0      0
SRR1387702     0     1      0
SRR606939      0     1      0
SRR1468613     0     1      0
SRR1469746     0     0      1
SRR1071807     0     0      1
SRR1486080     0     0      1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"

Building the constrast matrix.

HeartvsBrain KidneyvsBrain KidneyvsHeart
Brain -1 -1 0
Heart 1 0 -1
Kidney 0 1 1

Mean-Variance Trend

voom transformation is applied to the normalized and filtered DGEList object.

Linear Model

To find the genes that are differentially expressed in limma we use a traditional linear model as already discussed above.

Summary

Summary results for all tests performed.

HeartvsBrain KidneyvsBrain KidneyvsHeart
Down 3382 3357 2373
NotSig 17518 17523 19952
Up 3546 3566 2121

Testing DE Genes

Same threshold parameters of the egdeR analysis are used for the selection of signficant differentially expressed genes. Significance level of \(\alpha\) = 0.01 and absolute value of log2FC > 1.

Heart vs. Brain

Genes differentially expressed in Heart tissue with respect to Brain tissue.

logFC AveExpr t P.Value adj.P.Val B
ENSG00000168280.16 -9.044419 4.574942 -25.16726 0 1.57e-05 11.08187
ENSG00000173991.5 9.484989 5.475126 22.82238 0 1.94e-05 11.19243
ENSG00000237298.9 7.840217 6.423091 19.87601 0 2.90e-05 10.68048
ENSG00000121577.13 8.397964 4.414277 19.70534 0 2.90e-05 10.01234
ENSG00000122367.19 7.826212 5.796136 19.61430 0 2.90e-05 10.58620
ENSG00000100170.9 10.085167 2.199401 19.11611 0 2.90e-05 9.20982

Kidney vs. Brain

Genes differentially expressed in Kidney tissue with respect to Brain tissue.

logFC AveExpr t P.Value adj.P.Val B
ENSG00000168280.16 -7.740418 4.5749420 -25.41409 0 1.43e-05 11.883900
ENSG00000166589.12 13.127350 1.1755814 17.72422 0 5.00e-05 8.290850
ENSG00000242366.3 13.126162 -0.6949378 17.56966 0 5.00e-05 8.163250
ENSG00000243135.6 13.124438 -0.6955107 17.55710 0 5.00e-05 8.159920
ENSG00000240224.1 13.124704 -0.6954226 17.55442 0 5.00e-05 8.159204
ENSG00000244474.5 13.134763 -0.6920660 17.55356 0 5.00e-05 8.158920

Kidney vs. Heart

Genes differentially expressed in Kidney tissue with respect to Heart tissue.

logFC AveExpr t P.Value adj.P.Val B
ENSG00000173991.5 -10.284285 5.475126 -23.34910 0 2.75e-05 11.108162
ENSG00000186439.12 -10.022586 2.436215 -20.89037 0 2.75e-05 9.613491
ENSG00000122367.19 -9.411969 5.796136 -20.43170 0 2.75e-05 10.476033
ENSG00000237298.9 -7.754575 6.423091 -19.99943 0 2.75e-05 10.765637
ENSG00000155657.25 -8.873335 7.927762 -19.56746 0 2.75e-05 10.682676
ENSG00000283186.1 -8.567486 7.690718 -19.41187 0 2.75e-05 10.619962

Question 9

Checking if the same gene returned by analysis with limma is the same returned in above analysis.

[1] TRUE
[1] "ENSG00000168280.16"

Indeed, the same gene, ENSG00000168280.16, is returned.